Project Draft

Jeremy James

Introduction

Background

Home Credit is an international consumer finance provider that lends money primarily to those with little or no credit history. They created a kaggle competition where users would use machine learning and statistical methods to predict the loan default risk of individuals based on Home Credit's loan applicant data. Although Home Credit had already used machine learning to project default risk, they hoped to find ways to improve their predictive ability based on what the Kagglers produced.

Significance

Home Credit serves those without credit history or are unbanked, who are likely to be viewed as high risk for a loan default, even if they are financially responsible and will make the necessary repayments. Home Credit needs to avoid making loans to those who will end up unable to complete payments, as this can result in a larger financial cost than the gain of several successful loans. They also want to make their services as accessible as possible through providing the loans to all those who are truly eligible. Machine learning can help us manage these risks and rewards. By building an accurate model, we will be providing Home Credit the ability to provide more loans, since loan applicants identified by our model as at lower risk for a loan default will be less likely to default on the loan.

Research Question(s)

The main research question is whether we can create a model that can estimate the probability of a loan applicant defaulting. Another goal would be to find the features that best predict the default risk.

Data

Source

The data was retrieved from the Home Credit Defaul Risk Prediction competition on kaggle.com (https://www.kaggle.com/c/home-credit-default-risk/data). The data and supporting information was contained in 10 files, and is 2.68 GB in size.

Original Data Overview

The training dataset is contained in application_train.csv, and contains a total of 307511 rows.

Below is view of the first 5 rows:

There are a 122 columns in total, so the above dataset view only contains a fraction of these.

56 of these are categorical, while the rest are numerical.

I will also be using bureau.csv and bureau_balance.csv, files containing data on existing and past loan balances at other institutions for the loan applicant. The data I am interested in is in bureau_balance.csv, so while there are several columns in bureau.csv, I am only interesting in using the two id columns it contains. These will allow me to match the data in application_train.csv to bureau_balance.csv. Here is the view of the first 5 rows of bureau_balance:

This is a large dataset, as it contains a record for every month and every previously existing/closed loans for the applicants.

Variables

Target

Our target variable is whether the loan applicant defaulted. It is a boolean value, with a value of 0 representing no default, and 1 signaling a loan applicant that ended up defaulting. This is an unbalanced data point, as there are almost 10 times more succesfull loans than defaults

Personal

Several variables contain information about the individual applying, including demographics like gender or education status, or miscellaneous information like whether the person owns a home or car. These variables are mainly boolean or categorical.

Loan

This group of variables describe the loan itself, such as the amount of money being borrowed, the time the approval process began, and what documentation was provided. These variables are of all types, including numerical.

Financial Information

These variables contain the financial information of the applicant. They include normalized scores (I believe these scores act like a credit score, but Home Credit does not state this in the column description they provided) and the amount of enquiries to Credit Bureau about the client in the time prior to the application. These variables are mainly numerical.

Employment

Home Credit has provided columns containing the applicant's annual income and field of work. These variables are of all types.

Property info

A fair amount of variables contain data about the property the client lives in. These are numeric, and have been normalized.

Regional Info

This group of variables describe the region where the client lives, including the normalized population and Home Credit's rating of the region.

Missing Data and Imputation

Most columns are missing less than 1% of their data, outside of the property info columns. Many of the property info columns are missing the majority of their values, which may prevent me from using their values. Other variables with a large share of missing values are the applicant's car age, occupation type, two of their scores (normalized financial scores), and the Credit Bureau enquiry counts.

Records with null property values make up 75% of the data. Almost every record has a null record when you also include where the other frequently missing columns. However, ignoring these columns, only 1% of our data would have a null value.

I am planning to drop the property columns from the dataset, and may revisit them later to do some additional feature engineering and data cleansing on them. I will also ignore the applicant's car age, as 2/3 of its values are missing. The remain columns have a significantly lower amount of missing values.

For categorical columns with missing values, I will either choose 1 category as the default category (a category that shouldn't have effect on default probability), or create a new "unknown" category. For numeric columns, I will use KNN imputation with 5 neighbors to fill the values or choose a default value. In order to do the imputation, I selected a subset of columns to reduce computation time and scaled the values. With these changes, I no longer have any missing values in my training dataset.

Data Re-structuring and Wrangling

Home Credit has also provided a file with past loan balances at other institutions for the loan applicant, titled bureau_balance.csv. I will be aggregating this file by loan to get the count of months where the loan was late on payment. I will than match up these loans to the applicants using the data in bureau.csv. Applicants can have multiple loans, so for each applicant, I will take the total number of defaults accross all their loans. Final, I will then add the total default count as a new variable in our training dataset by doing a left join on the applicant id, where the left dataset is our original training dataset.

New Variables

So far, I have created 3 new variables. The first is the social circle default ratio columns. These 2 columns our based on 4 other columns in the original dataset that contain the amount of observations along with the amount of defaults in the client's social circle. By creating a ratio, we should be able to know the true propensity of defaults in the client's social circle as a ratio, instead of only going off the raw default count. These values are generated for a 30 day default and 60 day default.

The other new column was the amount of requests to the Credit Bureau over the year prior to the application. In the original table, the total number of requests are split between the last hour, day, week, month, quarter, and year, excluding the observations added to another column. For example, the year request count excludes the count from the last quarter. I will be adding these columns together for one final annual total, as I don't believe we have enough observations for the smaller periods for them to be useful. Finally, the column from the bureau aggregation will also be added to the training dataset.

Clean Data Overview

After our column drops and additions, we have 78 columns in our cleaned data, 52 of which are categorical and 26 numerical. Below is a views of the first 5 rows of this data.

Methods

Model List

  1. Explainable Boosting Machine
  2. Weighted Random Forest

Explainable Boosting Machine

Citations

InterpretML. "Explainable Boosting Machine". https://interpret.ml/docs/ebm.html

Harsha Nori et. al. "InterpretML: A Unified Framework for Machine Learning". https://arxiv.org/pdf/1909.09223.pdf

Yin Lou et. al. "Intelligible Models for Classification and Regression". https://www.cs.cornell.edu/~yinlou/papers/lou-kdd12.pdf

Theory

Explainable Boosting Machine (EBM) is a generalized additive model (GAM), having the form:

$$g(E[y]) = \beta _{0} + \Sigma f_{j}(x_{j})$$

There are two major differences between EBM and a standard GAM. First, each feature function $f_{j}$ is tuned using techniques like bagging and gradient boosting. Second, automatic pairwise interaction detection is supported, so the EBM can be represented as:

$$g(E[y]) = \beta _{0} + \Sigma f_{i}(x_{i}) + \Sigma f_{i,j}(x_{i},x_{j})$$

A full description of the algorithm can be found in "Intelligible Models for Classification and Regression".

Software Package

EBM can be found in InterpretML, a package containing several machine learning interpretability methods. I will be using the python version. InterpretML was developed at Microsoft Research and made open source. EBM is an implemetation of the algorithm proposed by Lou et. al. ("Intelligible Models for Classification and Regression"), and follows scikit-learn's API. As part of its "interpretable" nature, visuals revealing the model's structure or explaining its output are easily created. Example code can be found in the "Model Results" section of this report.

Weighted Random Forest

Citations

Chen et. al. "Using Random Forest to Learn Imbalanced Data". https://unomaha.instructure.com/courses/51016/pages/teaching-presentation

Leo Breiman. "Random Forests". https://link.springer.com/content/pdf/10.1023%2FA%3A1010933404324.pdf

Theory

A random forest classifier consists of multiple trees, all providing predictions given an input. Each of these trees is trained using a random subset of variables and observations from the original input. When dealing with imbalanced data, random forests will often ignore the minority class, as the standard error/objective functions provide little penalty for misclassified minority class observations due to their low amount. One way to prevent this is applying a heavier penalty for misclassifying a minority class, which can be done by applying a large weight to minority classes. A normal Gini function (the function that calculates the impurity of a tree node) will look like the below:

$$1 - [(Ratio_{0})^2 + (Ratio_{1})^2]$$

A weighted version looks like this:

$$1 - [w_{0}(Ratio_{0})^2 + w_{1}(Ratio_{1})^2]$$

Software Package

I used the implementation of the Random Forest Classifier in Python's scikit-learn package. scikit-learn comes with many different machine learning models, as well as functions to preprocess data and evaluate model results. Due to scikit-learn's well know API, it was easy to create the model and generate predictions. Example code can be found in the "Model Results" section.

Results

Data Exploratory Analysis

We have 78 columns in our final dataset. I have selected a subset of these that showcase the kinds of relationships the independent variables have with our target variable.

Score

The score variables (column names EXT_SOURCE_1, 2, and 3) have the most clear relationship with the default count. The below graph displays the default rate for different buckets of the second score. As the score increases, defaults become less common.

Other numerical columns

Unfortunately, most variables do not share the strong relationship the score variables have with the default rate. But differences in the default rate can be seen, as one can find in the plots below.

Below is an example of a variable I thought would have a relationship, but didn't appear to.

Categorical columns

Below are some plots with default rates for different categories. The code that creates these plots was mostly generated by mitosheet.

Model results

Weighted Random Forest

Below is the code to train a weighted random forest classifier and generate predictions.

Here are some accuracy metrics:

The confusion matrix below reveals that the classifier isn't paying attention to the positive class.

Although I modified the parameters for the random forest model in many different ways, I was unable to raise the precision for the positive label while producing reasonable results.

Explainable Boosting Machine (EBM)

Here are some accuracy metrics:

The precision is much higher than the random forest model, and the confusion matrix below also reveals that the classifier is paying attention to the positive class. Our accuracy may drop, but since we are dealing with an imbalanced dataset we want to trade off accuracy for correctly labeling the positive class.

The top 20 variables be importance can be found below.

Since EBM is a GAM, we can visualize the relationship between each dependent variable and the target. Below are 3 of these plots

One powerful feature of an EBM is the ability to see why an observation was assigned a label. Below are 3 examples of EBM's label justification. The first is a true negative. The applicant has great financial scores, and the price of whatever they are purchasing is a favorable amount as well, leading to a prediction of no default.

The second is a true positive. The awful value on the 3rd financial score lead to a prediction of default.

In the 3rd example, although the financial scores looked bad and lead to a default prediction, the loan applicant did not end up defaulting.

Further tuning of the EBM

I noticed that many of the relationships the EBM captured looked noisy. 2 examples can be seen below.

The default count vs score chart looks very odd. When you look at the density, you notice that there are few observations greater than 10 (in reality there are few greater than 4). Below is the same relationship but for a smaller range.

The count of children is another clear example of a weird relationship, more likely to be caused by the low amount of observations at higher values rather than a real trend. By default, EBMs have a minimal sample size of 2 for each leaf. I up that number significantly to force the EBM to only create leaves for significant sample sizes.

When we raised the minimum number of samples in every leaf parameter from 2 to 1000, we do not see a significant difference accuracy wise.

Our default count relationship did not change much, so clearly there are a lot of observations that are driving it.

The relationship between count of children and default is much less volatile.